Gradient Descent and Feature Engineering¶

In this project, I worked through the process of:

  1. Defining loss functions
  2. Feature engineering
  3. Minimizing loss functions using numeric methods and analytical methods
  4. Understanding what happens if we use the analytical solution for OLS on a matrix with redundant features
  5. Computing a gradient for a nonlinear model
  6. Using gradient descent to optimize the nonline model

By: Victor Cornejo





Loading the Tips Dataset¶

To begin, let's load the tips dataset from the seaborn library. This dataset contains records of tips, total bill, and information about the person who paid the bill. As earlier, we'll be trying to predict tips from the other data.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
import matplotlib.pyplot as plt
np.random.seed(42)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline
In [3]:
data = sns.load_dataset("tips")

print("Number of Records:", len(data))
data.head()
Number of Records: 244
Out[3]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4



Defining the Model and Feature Engineering¶

Moving away fromt the constant model, it was time for a more complicated model that utilized other features in the dataset. I imagined that I might want to use the features with an equation that looks as shown below:

$$ \text{Tip} = \theta_1 \cdot \text{total}\_\text{bill} + \theta_2 \cdot \text{sex} + \theta_3 \cdot \text{smoker} + \theta_4 \cdot \text{day} + \theta_5 \cdot \text{time} + \theta_6 \cdot \text{size} $$

Unfortunately, that's not possible because some of these features like "day" are not numbers, so it doesn't make sense to multiply by a numerical parameter.

Therefore, I started by converting some of these non-numerical values into numerical values. Before I did this, I separated out the tips and the features into two separate variables.

In [4]:
tips = data['tip']
X = data.drop(columns='tip')

Feature Engineering¶

First, I converted my features to numerical values. A straightforward approach was to map some of these non-numerical features into numerical ones.

For example, I could treat the day as a value from 1-7. However, one of the disadvantages in directly translating to a numeric value is that of unintentionally assigning certain features disproportionate weight. For example; assigning Sunday to the numeric value of 7, and Monday to the numeric value of 1. In the linear model, Sunday will have 7 times the influence of Monday, which can lower the accuracy of the model.

Instead, I used one-hot encoding to better represent these features.

In the tips dataset for example, I encoded Sunday as the vector [0 0 0 1] because the dataset only contains bills from Thursday through Sunday. This assigns a more even weight across each category in non-numeric features. The code below one-hot encodes our dataset. The returning dataframe holds the "featurized" data, which is also often denoted by $\phi$.

In [5]:
pd.get_dummies(X["day"]).head()
Out[5]:
Thur Fri Sat Sun
0 0 0 0 1
1 0 0 0 1
2 0 0 0 1
3 0 0 0 1
4 0 0 0 1
In [6]:
def one_hot_encode(data):
    """
    Return the one-hot encoded dataframe of our input data.
    
    Parameters
    -----------
    data: a dataframe that may include non-numerical features
    
    Returns
    -----------
    A one-hot encoded dataframe that only contains numeric features
    
    """
    data = pd.get_dummies(data)
    return data
    
one_hot_X = one_hot_encode(X)
one_hot_X.head()
Out[6]:
total_bill size sex_Male sex_Female smoker_Yes smoker_No day_Thur day_Fri day_Sat day_Sun time_Lunch time_Dinner
0 16.99 2 0 1 0 1 0 0 0 1 0 1
1 10.34 3 1 0 0 1 0 0 0 1 0 1
2 21.01 3 1 0 0 1 0 0 0 1 0 1
3 23.68 2 1 0 0 1 0 0 0 1 0 1
4 24.59 4 0 1 0 1 0 0 0 1 0 1

Defining the Model¶

Once all of the data was numeric, I could begin to define the model function. Notice that after one-hot encoding the data, I now had 12 features instead of 6. Therefore, the linear model looked like:

$$ \text{Tip} = \theta_1 \cdot \text{size} + \theta_2 \cdot \text{total}\_\text{bill} + \theta_3 \cdot \text{day}\_\text{Thur} + \theta_4 \cdot \text{day}\_\text{Fri} + ... + \theta_{11} \cdot \text{time}\_\text{Lunch} + \theta_{12} \cdot \text{time}\_\text{Dinner} $$

I could represent the linear combination above as a matrix-vector product. Implementing the linear_model function to evaluate this product.

Below, I created a MyLinearModel class with two methods, predict and fit. When fitted, this model fails to do anything useful, setting all of its 12 parameters to zero.

In [8]:
class MyLinearModel():    
    def predict(self, X):
        return X @ self._thetas
    
    def fit(self, X, y):
        number_of_features = X.shape[1]
        self._thetas = np.zeros(shape = (number_of_features, 1))
In [9]:
model = MyLinearModel()
model.fit(one_hot_X, tips)
model._thetas
Out[9]:
array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]])



Fitting a Linear Model using scipy.optimize.minimize Methods¶

By using the scipy.optimize.minimize function, I adapted the code below to implement the fit method of the linear model.

I added a loss_function parameter where the model is fit using the desired loss function, i.e. not necssarily the L2 loss. Example loss function are given as l1 and l2.

In [10]:
X.shape[1]
Out[10]:
6
In [11]:
from scipy.optimize import minimize

def l1(y, y_hat):
    return np.abs(y - y_hat)

def l2(y, y_hat):
    return (y - y_hat)**2

class MyLinearModel():    
    def predict(self, X):
        return X @ self._thetas
    
    def fit(self, loss_function, X, y):
        """
        Produce the estimated optimal _thetas for the given loss function, 
        feature matrix X, and observations y.

        Parameters
        -----------
        loss_function: either the squared or absolute loss functions defined above
        X: a 2D dataframe (or numpy array) of numeric features (one-hot encoded)
        y: a 1D vector of tip amounts

        Returns
        -----------
        The estimate for the optimal theta vector that minimizes our loss
        """
        
        number_of_features = X.shape[1]

        # 
        # 0. The starting guess should be some arbitrary array of the correct length.
        #    Note the "number of features" variable above."
        # 1. The ... in "lambda theta: ..." should be replaced by the average loss if we
        #    compute X @ theta. The loss is measured using the given loss function,
        #    relative to the observations in the variable y.
        
        starting_guess = np.zeros(shape = (number_of_features, 1)) ###Modified --> Here
        self._thetas = minimize(lambda theta: 
                                np.mean(loss_function(y, X @ theta)) ###Modified --> Here
                                , x0 = starting_guess)['x']
        
model = MyLinearModel()
model.fit(l2, one_hot_X, tips)
model._thetas
/tmp/ipykernel_106/482866692.py:39: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
  self._thetas = minimize(lambda theta:
Out[11]:
array([0.094487  , 0.1759912 , 0.18411073, 0.21655221, 0.15712765,
       0.24353564, 0.0151997 , 0.17746083, 0.05600925, 0.15199322,
       0.23440138, 0.16626186])

The MSE for the model above was just slightly larger than 1:

In [13]:
from sklearn.metrics import mean_squared_error
mean_squared_error(model.predict(one_hot_X), tips)
Out[13]:
1.010353561236632



Fitting the Model using Analytic Methods¶

I also had to fit the model analytically for the L2 loss function. Recalling that with a linear model, I was solving the following optimization problem for least squares:

$$\min_{\theta} ||\Bbb{X}\theta - \Bbb{y}||^2$$

The optimal $\hat{\theta}$ when $X^TX$ is invertible is given by the equation: $(X^TX)^{-1}X^TY$


Analytic Solution Using Explicit Inverses¶

For this part, I implemented the analytic solution above using np.linalg.inv to compute the inverse of $X^TX$.

In [14]:
class MyAnalyticallyFitOLSModel():    
    def predict(self, X):
        return X @ self._thetas
    
    def fit(self, X, y):
        """
        Sets _thetas using the analytical solution to the ordinary least squares problem

        Parameters
        -----------
        X: a 2D dataframe (or numpy array) of numeric features (one-hot encoded)
        y: a 1D vector of tip amounts

        Returns
        -----------
        The estimate for theta computed using the equation mentioned above
        """
        
        xTx = X.T @ X # SOLUTIONx
        xTy = X.T @ y # SOLUTION
        self._thetas = np.linalg.inv(xTx) @ xTy # SOLUTION

        

The cell below is the analytical solution for the tips dataset: A singular matrix error or end up with thetas that are nonsensical (magnitudes greater than 10^15). This is no bueno...

In [15]:
# When you run the code below, you should get back some non zero thetas.
        
model = MyAnalyticallyFitOLSModel()
model.fit(one_hot_X, tips)
analytical_thetas = model._thetas
analytical_thetas
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
Cell In[15], line 4
      1 # When you run the code below, you should get back some non zero thetas.
      3 model = MyAnalyticallyFitOLSModel()
----> 4 model.fit(one_hot_X, tips)
      5 analytical_thetas = model._thetas
      6 analytical_thetas

Cell In[14], line 21, in MyAnalyticallyFitOLSModel.fit(self, X, y)
     19 xTx = X.T @ X # SOLUTIONx
     20 xTy = X.T @ y # SOLUTION
---> 21 self._thetas = np.linalg.inv(xTx) @ xTy

File <__array_function__ internals>:180, in inv(*args, **kwargs)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/numpy/linalg/linalg.py:552, in inv(a)
    550 signature = 'D->D' if isComplexType(t) else 'd->d'
    551 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 552 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    553 return wrap(ainv.astype(result_t, copy=False))

File /srv/conda/envs/notebook/lib/python3.9/site-packages/numpy/linalg/linalg.py:89, in _raise_linalgerror_singular(err, flag)
     88 def _raise_linalgerror_singular(err, flag):
---> 89     raise LinAlgError("Singular matrix")

LinAlgError: Singular matrix

I got the error above when trying to calculate the analytical solution for our one-hot encoded tips dataset because my Matrix was NOT Invertible


Fixing the One-Hot Encoding¶

Fixing the one-hot encoding approach so I didn't get the error from above, I used the code below to one-hot-encode the dataset such that one_hot_X_revised had no redundant features.

In [16]:
X.head(5)
Out[16]:
total_bill sex smoker day time size
0 16.99 Female No Sun Dinner 2
1 10.34 Male No Sun Dinner 3
2 21.01 Male No Sun Dinner 3
3 23.68 Male No Sun Dinner 2
4 24.59 Female No Sun Dinner 4
In [17]:
pd.get_dummies(X)
Out[17]:
total_bill size sex_Male sex_Female smoker_Yes smoker_No day_Thur day_Fri day_Sat day_Sun time_Lunch time_Dinner
0 16.99 2 0 1 0 1 0 0 0 1 0 1
1 10.34 3 1 0 0 1 0 0 0 1 0 1
2 21.01 3 1 0 0 1 0 0 0 1 0 1
3 23.68 2 1 0 0 1 0 0 0 1 0 1
4 24.59 4 0 1 0 1 0 0 0 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ...
239 29.03 3 1 0 0 1 0 0 1 0 0 1
240 27.18 2 0 1 1 0 0 0 1 0 0 1
241 22.67 2 1 0 1 0 0 0 1 0 0 1
242 17.82 2 1 0 0 1 0 0 1 0 0 1
243 18.78 2 0 1 0 1 1 0 0 0 0 1

244 rows × 12 columns

In [18]:
pd.get_dummies(X, drop_first = True)
Out[18]:
total_bill size sex_Female smoker_No day_Fri day_Sat day_Sun time_Dinner
0 16.99 2 1 1 0 0 1 1
1 10.34 3 0 1 0 0 1 1
2 21.01 3 0 1 0 0 1 1
3 23.68 2 0 1 0 0 1 1
4 24.59 4 1 1 0 0 1 1
... ... ... ... ... ... ... ... ...
239 29.03 3 0 1 0 1 0 1
240 27.18 2 1 0 0 1 0 1
241 22.67 2 0 0 0 1 0 1
242 17.82 2 0 1 0 1 0 1
243 18.78 2 1 1 0 0 0 1

244 rows × 8 columns

In [19]:
def one_hot_encode_revised(data):
    """
    Return the one-hot encoded dataframe of our input data, removing redundancies.
    
    Parameters
    -----------
    data: a dataframe that may include non-numerical features
    
    Returns
    -----------
    A one-hot encoded dataframe that only contains numeric features without any redundancies.
    
    """
    one_hot_data = pd.get_dummies(data, drop_first = True)
    return one_hot_data
    

one_hot_X_revised = one_hot_encode_revised(X)    
    
numerical_model = MyLinearModel()
numerical_model.fit(l2, one_hot_X_revised, tips)
    
analytical_model = MyAnalyticallyFitOLSModel()
analytical_model.fit(one_hot_X_revised, tips)


print("The numerical model's loss is: ", mean_squared_error(numerical_model.predict(one_hot_X_revised), tips))
print("The analytical model's loss is: ", mean_squared_error(analytical_model.predict(one_hot_X_revised), tips))
The numerical model's loss is:  1.0332855709135256
The analytical model's loss is:  1.033285570878662
/tmp/ipykernel_106/482866692.py:39: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
  self._thetas = minimize(lambda theta:

Analyzing the new One-Hot Encoding¶

By removing the redundant features, I was able to find the inverse of the matrix.



Gradient Descent¶

In [20]:
#Loading the Data
df = pd.read_csv("lab7_data.csv", index_col=0)
df.head()
Out[20]:
x y
0 -5.000000 -7.672309
1 -4.966555 -7.779735
2 -4.933110 -7.995938
3 -4.899666 -8.197059
4 -4.866221 -8.183883

Plotting the data, I coud see that there was a clear sinusoidal relationship between x and y.

In [21]:
import plotly.express as px
px.scatter(df, x = "x", y = "y")

however, gradient descent is so powerful it can even optimize a nonlinear model. Specifically, I modeled the relationship of the data by:

$$\Large{ f_{\boldsymbol{\theta(x)}} = \theta_1x + sin(\theta_2x) }$$

My model is parameterized by both $\theta_1$ and $\theta_2$, which can represented in the vector, $\boldsymbol{\theta}$.

Note that a general sine function $a\sin(bx+c)$ has three parameters: amplitude scaling parameter $a$, frequency parameter $b$ and phase shifting parameter $c$.

Here, I assumed the amplitude $a$ was around 1, and the phase shifting parameter $c$ was around zero. I did not attempt to justify this assumption.

I defined the sin_model function below that predicts $\textbf{y}$ (the $y$-values) using $\textbf{x}$ (the $x$-values) based on the new equation.

In [22]:
def sin_model(x, theta):
    """
    Predict the estimate of y given x, theta_1, theta_2

    Keyword arguments:
    x -- the vector of values x
    theta -- a vector of length 2, where theta[0] = theta_1 and theta[1] = theta_2
    """
    theta_1 = theta[0]
    theta_2 = theta[1]
    return theta_1 * x + np.sin(theta_2 * x)

Computing the Gradient of the MSE With Respect to Theta on the Sin Model¶

$\hat{\theta}$ is the value of $\theta$ that minimizes the loss function. One way of solving for $\hat{\theta}$ could be by computing the gradient of the loss function with respect to $\theta$.

  • $L(\textbf{x}, \textbf{y}, \theta_1, \theta_2)$: the loss function, the mean squared error
  • $\frac{\partial L }{\partial \theta_1}$: the partial derivative of $L$ with respect to $\theta_1$
  • $\frac{\partial L }{\partial \theta_2}$: the partial derivative of $L$ with respect to $\theta_2$

  • $L(\textbf{x}, \textbf{y}, \theta_1, \theta_2) = \frac{1}{n} \sum_{i=1}^{n} (\textbf{y}_i - \hat{\textbf{y}}_i)^2$

Specifically, the functions sin_MSE, sin_MSE_dt1 and sin_MSE_dt2 compute $R$, $\frac{\partial R }{\partial \theta_1}$ and $\frac{\partial R }{\partial \theta_2}$ respectively. Using the expressions I wrote for $\frac{\partial R }{\partial \theta_1}$ and $\frac{\partial R }{\partial \theta_2}$ I implemented these functions. In the functions below, the parameter theta is a vector that looks like $\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}$. Where sin_MSE_gradient, calls dt1 and dt2 and returns the gradient dt.

In [24]:
def sin_MSE(theta, x, y):
    """
    Compute the numerical value of the l2 loss of our sinusoidal model given theta

    Keyword arguments:
    theta -- the vector of values theta
    x     -- the vector of x values
    y     -- the vector of y values
    """
    return np.mean((y - sin_model(x, theta)) ** 2)


def sin_MSE_dt1(theta, x, y):
    """
    Compute the numerical value of the partial of l2 loss with respect to theta_1

    Keyword arguments:
    theta -- the vector of values theta
    x     -- the vector of x values
    y     -- the vector of y values
    """
    return np.mean(-2 * x * (y - sin_model(x, theta)))

    
def sin_MSE_dt2(theta, x, y):
    """
    Compute the numerical value of the partial of l2 loss with respect to theta_2

    Keyword arguments:
    theta -- the vector of values theta
    x     -- the vector of x values
    y     -- the vector of y values
    """
    return np.mean(-2 * x * np.cos(theta[1] * x) * (y - sin_model(x, theta)))
    
# This function calls dt1 and dt2 and returns the gradient dt. It is already implemented for you.
def sin_MSE_gradient(theta, x, y):
    """
    Returns the gradient of l2 loss with respect to vector theta

    Keyword arguments:
    theta -- the vector of values theta
    x     -- the vector of x values
    y     -- the vector of y values
    """
    return np.array([sin_MSE_dt1(theta, x, y), sin_MSE_dt2(theta, x, y)])

Implementing Gradient Descent and Using It to Optimize the Sin Model¶

In [25]:
def init_theta():
    """Creates an initial theta [0, 0] of shape (2,) as a starting point for gradient descent"""
    return np.array([0, 0])

def grad_desc(loss_f, gradient_loss_f, theta, data, num_iter=20, alpha=0.1):
    """
    Run gradient descent update for a finite number of iterations and static learning rate

    Keyword arguments:
    loss_f -- the loss function to be minimized (used for computing loss_history)
    gradient_loss_f -- the gradient of the loss function to be minimized
    theta -- the vector of values theta to use at first iteration
    data -- the data used in the model 
    num_iter -- the max number of iterations
    alpha -- the learning rate (also called the step size)
    
    Return:
    theta -- the optimal value of theta after num_iter of gradient descent
    theta_history -- the series of theta values over each iteration of gradient descent
    loss_history -- the series of loss values over each iteration of gradient descent
    """
    theta_history = []
    loss_history = []
    
    for i in range(num_iter):
        theta_history.append(theta)
        loss_history.append(loss_f(theta, data['x'], data['y']))
        theta = theta - alpha * gradient_loss_f(theta, data['x'], data['y'])
        
    return theta, theta_history, loss_history

theta_start = init_theta()
theta_hat, thetas_used, losses_calculated = grad_desc(
    sin_MSE, sin_MSE_gradient, theta_start, df, num_iter=20, alpha=0.1
)
for b, l in zip(thetas_used, losses_calculated):
    print(f"theta: {b}, Loss: {l}")
theta: [0 0], Loss: 20.859191416422235
theta: [2.60105745 2.60105745], Loss: 9.285008173048666
theta: [0.90342728 2.59100602], Loss: 4.680169273815357
theta: [2.05633644 2.9631291 ], Loss: 2.6242517936325833
theta: [1.15892347 2.86687431], Loss: 1.4765157174727774
theta: [1.79388042 3.07275573], Loss: 0.9073271435862448
theta: [1.32157494 3.00146569], Loss: 0.541531643291128
theta: [1.64954491 3.02910866], Loss: 0.3775841142469479
theta: [1.42325294 2.98820303], Loss: 0.2969750688130759
theta: [1.58295041 3.01033846], Loss: 0.2590425421375732
theta: [1.47097255 2.98926519], Loss: 0.23973439443291833
theta: [1.55040965 3.0017442 ], Loss: 0.23034782416254634
theta: [1.49439132 2.99135194], Loss: 0.2255775832667724
theta: [1.5341564  2.99797824], Loss: 0.22321772191904068
theta: [1.50603995 2.99286671], Loss: 0.22202363967204045
theta: [1.52598919 2.99628665], Loss: 0.22142811500262397
theta: [1.51186655 2.99375531], Loss: 0.22112776381775168
theta: [1.52188208 2.99549617], Loss: 0.22097741373654575
theta: [1.51478773 2.99423497], Loss: 0.22090173185683037
theta: [1.51981739 2.99511516], Loss: 0.2208637810584589

Now, I visually inspected my results of running gradient descent to optimize $\boldsymbol\theta$. The code below plots the $x$-values with the model's predicted $\hat{y}$-values over the original scatter plot. we can notice that gradient descent successfully optimized $\boldsymbol\theta$.

In [26]:
theta_init = init_theta()

theta_est, thetas, loss = grad_desc(sin_MSE, sin_MSE_gradient, theta_init, df)

Plotting the model output over the observations shows that gradient descent did a great job finding both the overall increase (slope) of the data, as well as the oscillation frequency.

In [27]:
x, y = df['x'], df['y']
y_pred = sin_model(x, theta_est)

plt.plot(x, y_pred, label='Model ($\hat{y}$)')
plt.scatter(x, y, alpha=0.5, label='Observation ($y$)', color='gold')
plt.legend();



Visualizing Loss¶

I also visualized the loss functions and gained some insight as to how gradient descent optimizes the model parameters.

The previous plot showed the loss decrease with each iteration. In this part, I display the trajectory of the algorithm as it travels the loss surface.

In [28]:
thetas = np.array(thetas).squeeze()
loss = np.array(loss)
thetas
Out[28]:
array([[0.        , 0.        ],
       [2.60105745, 2.60105745],
       [0.90342728, 2.59100602],
       [2.05633644, 2.9631291 ],
       [1.15892347, 2.86687431],
       [1.79388042, 3.07275573],
       [1.32157494, 3.00146569],
       [1.64954491, 3.02910866],
       [1.42325294, 2.98820303],
       [1.58295041, 3.01033846],
       [1.47097255, 2.98926519],
       [1.55040965, 3.0017442 ],
       [1.49439132, 2.99135194],
       [1.5341564 , 2.99797824],
       [1.50603995, 2.99286671],
       [1.52598919, 2.99628665],
       [1.51186655, 2.99375531],
       [1.52188208, 2.99549617],
       [1.51478773, 2.99423497],
       [1.51981739, 2.99511516]])
In [29]:
from lab7_utils import plot_3d
plot_3d(thetas[:, 0], thetas[:, 1], loss, mean_squared_error, sin_model, x, y)
In [30]:
import plotly
import plotly.graph_objs as go
In [31]:
def contour_plot(title, theta_history, loss_function, model, x, y):
    """
    The function takes the following as argument:
        theta_history: a (N, 2) array of theta history
        loss: a list or array of loss value
        loss_function: for example, l2_loss
        model: for example, sin_model
        x: the original x input
        y: the original y output
    """
    theta_1_series = theta_history[:,0] # a list or array of theta_1 value
    theta_2_series = theta_history[:,1] # a list or array of theta_2 value

    ## In the following block of code, I generate the z value
    ## across a 2D grid
    theta1_s = np.linspace(np.min(theta_1_series) - 0.1, np.max(theta_1_series) + 0.1)
    theta2_s = np.linspace(np.min(theta_2_series) - 0.1, np.max(theta_2_series) + 0.1)

    x_s, y_s = np.meshgrid(theta1_s, theta2_s)
    data = np.stack([x_s.flatten(), y_s.flatten()]).T
    ls = []
    for theta1, theta2 in data:
        l = loss_function(model(x, np.array([theta1, theta2])), y)
        ls.append(l)
    z = np.array(ls).reshape(50, 50)
    
    # Create trace of theta point
    # Create the contour 
    theta_points = go.Scatter(name="theta Values", 
                              x=theta_1_series, 
                              y=theta_2_series,
                              mode="lines+markers")
    lr_loss_contours = go.Contour(x=theta1_s, 
                                  y=theta2_s, 
                                  z=z, 
                                  colorscale='Viridis', reversescale=True)

    plotly.offline.iplot(go.Figure(data=[lr_loss_contours, theta_points], layout={'title': title}))
In [32]:
contour_plot('Gradient Descent with Static Learning Rate', thetas, mean_squared_error, sin_model, df["x"], df["y"])

As we can see, gradient descent is able to navigate even this fairly complex loss space and find a nice minimum.